Redundant array of independent memory

A redundant array of independent memory (RAIM) is a design feature found in certain computers' main random access memory. RAIM utilizes additional memory modules and striping algorithms to protect against the failure of any particular module and keep the memory system operating continuously. RAIM is similar in concept to a redundant array of independent disks (RAID), which protects against the failure of a disk drive, but in the case of memory it supports several DRAM device chipkills and entire memory channel failures. RAIM is much more robust than parity checking and ECC memory technologies which cannot protect against many varieties of memory failures.

On July 22, 2010, IBM introduced the first (and thus far only) high end computer server featuring RAIM, the zEnterprise 196. Each z196 machine contains up to 3 TB (usable) of RAIM-protected main memory. The formal announcement letter offered some additional information regarding the implementation:

[...] IBM's most robust error correction to date can be found in the memory subsystem. A new redundant array of independent memory (RAIM) technology is being introduced to provide protection at the dynamic random access memory (DRAM), dual inline memory module (DIMM), and memory channel level. Three full DRAM failures per rank can be corrected. DIMM level failures, including components such as the controller application specific integrated circuit (ASIC), the power regulators, the clocks, and the board, can be corrected. Memory channel failures such as signal lines, control lines, and drivers/receivers on the MCM can be corrected. Upstream and downstream data signals can be spared using two spare wires on both the upstream and downstream paths. One of these signals can be used to spare a clock signal line (one upstream and one downstream). Together these improvements are designed to deliver System z's most resilient memory subsystem to date.[1]

See also

References